Conversation
Introduces CodecChain, a frozen dataclass that chains array-array, array-bytes, and bytes-bytes codecs with synchronous encode/decode methods. Pure compute only -- no IO, no threading, no batching. Also adds sync roundtrip tests for individual codecs (blosc, gzip, zstd, crc32c, bytes, transpose, vlen) and CodecChain integration tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
updating this PR to blend the codecchain logic with array_spec logic to define a |
c0be8be to
71a780b
Compare
# Conflicts: # src/zarr/abc/store.py # src/zarr/storage/_common.py # src/zarr/storage/_local.py # src/zarr/testing/store.py # tests/test_codecs/test_zstd.py
|
this PR is necessary for the broader performance plan because it allows us to pipeline codecs that support synchronous execution (i.e., all of them, practically). We can use the this differs from the |
|
@TomAugspurger any interest in reviewing this PR? It's part of a series that culminate in the performance improvements evident in #3719 |
TomAugspurger
left a comment
There was a problem hiding this comment.
Big +1 to the general design and the implementation looks nice. Just minor nitpicks that can be addressed / ignored as you prefer.
src/zarr/core/codec_pipeline.py
Outdated
| ArrayArrayCodec transforms — i.e. the spec that feeds the | ||
| ArrayBytesCodec. | ||
|
|
||
| All codecs must implement ``SupportsSyncCodec``. Construction will |
There was a problem hiding this comment.
Idle thought without having yet looked at the implementation: make Codec generic over SupportsSyncCodec (via a protocol) so that this can be caught before runtime?
There was a problem hiding this comment.
I want to avoid touching the codec ABC, so maybe we defer this for a later effort
src/zarr/core/codec_pipeline.py
Outdated
| self._bb_codecs = bb | ||
|
|
||
| @property | ||
| def shape(self) -> tuple[int, ...]: |
There was a problem hiding this comment.
How are shape and dtype ultimately used? They're a bit complicated to understand. Presumably you need them for your full perf PR, but I wanted to confirm that.
There was a problem hiding this comment.
I'm on the fence about these attributes actually. I wanted to model the fact that the output of a ChunkTransform is an array, with a fixed shape and dtype, and the fact that the array -> array codecs can be thought of as layers of ChunkTransform objects. But I don't know if we actually need this. Maybe a richer return type annotation is a better way of conveying this information.
There was a problem hiding this comment.
these attributes are gone, we can add them later if they are actually valuable
src/zarr/core/codec_pipeline.py
Outdated
| """ | ||
| bb_out: Any = chunk_bytes | ||
| for bb_codec in reversed(self._bb_codecs): | ||
| bb_out = bb_codec._decode_sync(bb_out, self._ab_spec) # type: ignore[attr-defined] |
There was a problem hiding this comment.
Can you add a comment about why the type: ignore is needed? Presumably it relates to that isinstance(c, SupportsSyncCodec) check above, which mypy can't see here in decode?
There was a problem hiding this comment.
the type: ignore is gone thanks to making the protocol generic.
src/zarr/core/codec_pipeline.py
Outdated
| for aa_codec, spec in reversed(self._aa_codecs): | ||
| ab_out = aa_codec._decode_sync(ab_out, spec) # type: ignore[attr-defined] | ||
|
|
||
| return ab_out # type: ignore[no-any-return] |
There was a problem hiding this comment.
This type: ignore I don't understand. Is the Any up above accurate, or should this be NDBuffer | Buffer like I see here? And if it supposed to be NDBuffer | Buffer why do we declare we return -> NDBuffer
There was a problem hiding this comment.
this was because the protocol for synchronous encoding / decoding was not generic over input and output types. I fixed that, so the type: ignore statement is gone
tests/test_sync_codec_pipeline.py
Outdated
| def test_encode_decode_roundtrip_bytes_only(self) -> None: | ||
| # Minimal round-trip: BytesCodec serializes the array to bytes and back. | ||
| # No compression, no AA transform. | ||
| arr = np.arange(100, dtype="float64") | ||
| spec = _make_array_spec(arr.shape, arr.dtype) | ||
| chain = ChunkTransform(codecs=(BytesCodec(),), array_spec=spec) | ||
| nd_buf = _make_nd_buffer(arr) |
There was a problem hiding this comment.
FWIW, I'd be fine with consolidating these tests with the construction tests. I do like having focused construction tests when the constructors are complicated, but these seem simple enough that just seeing the traceback pointing to __post_init__ should be enough. Either works for me though.
There was a problem hiding this comment.
good idea. I kept the encoding / decoding separate from the constructor tests, but I also made the constructor tests much more compact.
tests/test_sync_codec_pipeline.py
Outdated
| encoded = chain.encode(nd_buf) | ||
| assert encoded is not None | ||
| decoded = chain.decode(encoded) | ||
| np.testing.assert_array_equal(arr, decoded.as_numpy_array()) |
There was a problem hiding this comment.
Might be worth refactoring this to a test helper, assert_encode_decode_equal(...).
Also, not worth worrying about currently, arr is possibly not a NumPy array, depending on what default_buffer_prototype() returns, in which case np.testing.assert_array_equal might not work. But to solve that more generally is out of scope.
src/zarr/core/codec_pipeline.py
Outdated
|
|
||
| if aa_out is None: | ||
| return None | ||
| bb_out: Any = self._ab_codec._encode_sync(aa_out, self._ab_spec) # type: ignore[attr-defined] |
There was a problem hiding this comment.
This is quite hard to read/review: "is the output of aa after application of ab really bb?!" Doesn't help that output is Any despite all the fancy typing.
There was a problem hiding this comment.
aa_out -> asarray; bb_out -> asbytes?
There was a problem hiding this comment.
ok so it's an issue with the variable names, I will see if I can make them more clear
There was a problem hiding this comment.
the properties are an issue too but I don't have a suggestion for that.
There was a problem hiding this comment.
[points accusingly at the v3 spec] making 3 different kinds of functions all "codecs" was maybe not the best choice. We could use the "filters, serializer, compressors" trinity used in create_array, but I worry that this is a bit too far from the language used by the spec.
There was a problem hiding this comment.
I'm going to punt on this for now. I legit think the naming issues here are a basic flaw in the v3 spec.
src/zarr/core/codec_pipeline.py
Outdated
| if bb_out is None: | ||
| return None |
There was a problem hiding this comment.
could be hoisted out of the loop
There was a problem hiding this comment.
or maybe not? I'm confused regardless.
There was a problem hiding this comment.
the whole thing with encoding potentially returning none is problematic and needs to go. Unfortunately that's based on the codec ABC which I am not touching right now :(
src/zarr/core/codec_pipeline.py
Outdated
| if aa_out is None: | ||
| return None |
There was a problem hiding this comment.
could be moved out of the loop? Can _encode_sync return None?
There was a problem hiding this comment.
Can _encode_sync return None?
unfortunately yes, in keeping with methods already defined on the Codec API.
|
thanks for the super helpful feedback @TomAugspurger and @dcherian! I made a lot of simplifying changes. LMK if anything else needs to be done, otherwise I will merge and move on to the next step of the performance work. |
This PR defines a
CodecChainobject, which is similar toCodecPipelineexcept that it performs no orchestration. It just defines array-spec-aware, pure-compute encoding and decoding routines for collections of codecs. Part of #3720depends on #3721
edit:
CodecChainis now calledChunkTransform